Authors:

Sam Abbott, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK

Hannah Christensen, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK

Ellen Brooks-Pollock, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK

Correspondence to: Sam Abbott, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol BS8 2BN, UK; ; 01173310185

Words: Title: 12 Abstract: 202 Paper: 3196

Abstract

Background

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

Methods

  • Introduce ETS
  • Data extraction and management
  • Structure of the ETS
  • Data completeness
  • Drivers of variable completeness (regression)

Results Copy from bottom

  • Missing structure
  • Drivers of variable completeness

Conclusions

  • Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as….
  • To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model.
  • This analysis should be repeated in other datasets - for this reason the code is available as an R package (https://doi.org/10.5281/zenodo.3492200).

Introduction

Background

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses.

Detail

Describe the ETS and use cases

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1] Common practise is to include all variables included in the analyses in the imputation model, these variables may or may not be those at most risk of introducing bias due to an MAR mechanism.

Aim

This study aims to explore the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

Methods

Enhanced tuberculosis surveillance (ETS) system

The ETS is a database that collects demographic, clinical, and microbiological data on all notified TB cases in England and is maintained by Public Health England (PHE). Notification is required by law, with health service providers having to inform PHE of all confirmed TB cases.[2] Data collection began in 2000 and was expanded, with additional variables, with the launch of a web based system in 2008.[3] It is updated annually with de-notifications, late notifications and other updates. A descriptive analysis of TB epidemiology in England is published each year, which reports on data collection, cleaning, and trends in TB incidence at both a national, and sub-national level.[2] Data on all notifications (114,820 notifications) from the ETS system from 2000 to 2015 were obtained from PHE via an application to the TB monitoring team. The code used for data cleaning is available as an R package (https://zenodo.org/badge/latestdoi/93072437).

Data completeness

As the ETS is aggregated across England, from a variety of sources, missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset the proportion of missing data can be calculated but care must be taken to account for nested variables (such as cause of death being dependent on date of death). To account for this when estimating the proportion of missing data we have assumed that nested variables take the value of their parent variable when missing. This approach may be biased for rare outcomes (such as death in the ETS) - for this reason we have also estimated the proportion of missing data by filtering top level variables required for the nested variable to be defined and then computed the proportion of notifications that were missing data for the outcome of interest.

Drivers of Variable completeness

Overview

Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Here we develop a method for this and apply it to several key outcomes including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.

We reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. This approach does not account for missingness within exploratory variables.

Statistical details

In order to reformulate missing data as a logistic regression we took the following steps:

  1. For the variable of interest create a new temporary binary variable, called data status, that is “Missing” when the variable of interest is missing and “Complete” when it is not. Specify “Complete” as the baseline.

  2. For nested variables exclude notifications that do not have the top level outcome required by the variable of interest. An example of this is excluding cases that did not die, or have a missing overall outcome, when investigating TB mortality.

  3. Specify the hypothesised drivers of missingness for the variable of interest. These should be variables with a reasonable hypothesis for how they would drive missingness in the variable of interest. They must also be relatively complete as this approach does not impute missing confounder data.

  4. Fit a logistic regression model with the temporary data status variable as the outcome, adjusting for the hypothesised drivers of missingness.

  5. Exponentiate the returned coefficients, and confidence intervals so that they represent Odds Ratios (ORs).

  6. Refit the model, dropping each variable in turn and then comparing the updated model with the full model using a likelihood ratio test.

  7. Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[5] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

For all outcomes considered we adjusted for the same set of demographic variables that were both highly complete, plausibly linked to missingness for all outcomes considered, and likely to be present in other comparable surveillance datasets. These were: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Complete case analysis has been used, with the dataset limited to notifications from 2010 and on-wards as socio-economic status was not collected prior to this. The code this approach is available as an R package online (https://doi.org/10.5281/zenodo.3492200).

Assessing potential biases in reporting

Patient and public involvement

We did not involve patients or the public in the design or planning of this study.

Results

Data completeness

We found high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Supplementary Figure S1, Table 1). More problematically, BCG status and year of BCG status had a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008.[2] Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete.[2] Comparing pre 2009 and post 2008 in Table 1 (Supplementary Figure S1) we see completeness changes over time.[2,6] There was some evidence that groups of variables had correlated missing data (Supplementary Figure S1).

Table 1: Percentage of missing data from the ETS for a subset of variables, prior to the web based system (pre 2009) and post (post 2008) by variable, ordered by the percentage missing for a subset of variables. Nested variables have been accounted for (i.e data of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.
Variable 2000-2008 2009-2015
Socio-economic status (quintiles) 100.0 (63175) 15.7 (8120)
Year of BCG vaccination 98.9 (62479) 60.8 (31421)
BCG status 98.0 (61916) 33.2 (17133)
Date of diagnosis 72.1 (45557) 19.9 (10303)
Sputum smear status 52.1 (32912) 62.1 (32094)
Time since entry 46.0 (29084) 36.2 (18670)
Drug resistance 43.5 (27485) 40.7 (20995)
Occupation 39.4 (24870) 10.7 (5513)
Date of symptom onset 37.9 (23937) 24.8 (12829)
Treatment end date 29.6 (18711) 2.2 (1137)
Previous diagnosis 20.9 (13204) 6.1 (3148)
Date of starting treatment 14.5 (9151) 4.1 (2127)
Cause of death 11.9 (7539) 2.3 (1191)
UK birth status 9.9 (6230) 3.5 (1825)
Overall outcome 9.6 (6044) 0.0 (0)
Started treatment 6.7 (4242) 1.2 (602)
Ethnic group 4.4 (2811) 2.4 (1229)
Date of death 2.0 (1235) 0.7 (357)
Pulmonary or extra-pulmonary TB 0.3 (177) 0.4 (213)
Sex 0.2 (101) 0.2 (110)
Public Health England Centre 0.1 (32) 0.0 (0)
Age 0.0 (25) 0.0 (0)
Date of notification 0.0 (0) 0.0 (0)
Year 0.0 (0) 0.0 (0)
Culture 0.0 (0) 0.0 (0)

By filtering nested variables - rather than by using replacement - we found the date of starting treatment was 5.9% (6434/108410) missing, which is more complete than previously estimated. For cases that were known to have completed treatment 16.5% (13804/83891) were missing a date for the end of treatment. In notifications that were known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death.

Drivers of Variable completeness

BCG status

There was evidence that BCG status was missing with a MAR mechanism for qll variables considered Table 2. BCG data missingness is strongly associated with year of notification, sex age, ethnic group, and socio-economic status. After adjusting for other variables data completeness increased from 2010 until 2012 but has since showed no clear trend. Men appeared to be more likely than women to have a missing BCG status, with the non-UK born also being more likely than the UK born to be missing BCG status. The proportion of those missing BCG status increased with age, with those aged 65+ being over 4 times more likely to be missing BCG status than those aged 0-14 years old. There was also evidence to suggest that notifications in the lowest socio-economic group were more likely to have a missing BCG status but there was no clear evidence of a trend across socio-economic quintiles. The White ethnic group was more likely to have a missing BCG status than any other ethnic group.

Table 2: Results from a logistic regression model with data completeness (Complete/Missing) for BCG vaccination as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that BCG status is missing at random for the variables considered.
Variable Category Missing (N) Notifications (41659) Odds Ratio P value (Wald) P value (LRT)
Year 2010 31.3% (2235) 7143 1.6e-08
2011 29.8% (2319) 7781 0.94 (0.88, 1.01) 0.107
2012 27.9% (2164) 7755 0.85 (0.79, 0.92) 1.93e-05
2013 27.1% (1907) 7034 0.79 (0.73, 0.85) 1.3e-09
2014 30.1% (1907) 6327 0.90 (0.83, 0.97) 0.00672
2015 29.7% (1668) 5619 0.88 (0.81, 0.95) 0.00104
Sex Female 27.4% (4847) 17664 5.21e-14
Male 30.6% (7353) 23995 1.19 (1.14, 1.24) 5.97e-14
Age 0-14 13.1% (235) 1793 8.49e-162
15-44 26.0% (6557) 25235 2.24 (1.94, 2.60) 5.72e-27
45-64 32.8% (2964) 9026 3.05 (2.63, 3.55) 3.38e-47
65+ 43.6% (2444) 5605 4.82 (4.13, 5.64) 1.93e-87
Ethnic group White 35.4% (2959) 8359 1.18e-14
Black-Caribbean 24.6% (228) 928 0.88 (0.74, 1.03) 0.124
Black-African 27.3% (1966) 7204 0.87 (0.79, 0.95) 0.00235
Black-Other 24.1% (89) 369 0.87 (0.67, 1.12) 0.275
Indian 25.9% (2805) 10848 0.71 (0.65, 0.77) 3.69e-16
Pakistani 33.2% (2258) 6806 0.85 (0.78, 0.93) 0.000209
Bangladeshi 27.9% (469) 1680 0.92 (0.81, 1.05) 0.214
Chinese 33.6% (166) 494 0.91 (0.74, 1.12) 0.395
Mixed / Other 25.3% (1260) 4971 0.80 (0.72, 0.88) 5.15e-06
UK birth status Non-UK Born 29.5% (9104) 30880 7.78e-28
UK Born 28.7% (3096) 10779 0.68 (0.63, 0.73) 2.69e-27
Socio-economic status 1 30.7% (4948) 16131 0.0647
2 26.8% (3383) 12621 1.01 (0.95, 1.07) 0.825
3 29.2% (1905) 6530 1.09 (1.01, 1.16) 0.0187
4 30.1% (1142) 3796 0.98 (0.90, 1.06) 0.616
5 31.8% (822) 2581 0.96 (0.87, 1.06) 0.415
Public Health England centre London 21.0% (3716) 17658 0
West Midlands 22.4% (1171) 5217
North West 51.8% (2112) 4075
South East 26.6% (1074) 4037
Yorkshire and the Humber 37.0% (1138) 3077
East of England 36.4% (969) 2662
East Midlands 45.3% (1154) 2548
South West 41.2% (657) 1595
North East 26.5% (209) 790

Year of BCG vaccination

As for BCG status, year of BCG vaccination was also clearly missing with MAR mechanisms for the variables considered (Supplementary Table S1). As for BCG status men were more likely to have a missing year of BCG vaccination as were the non-UK born. Older notifications were again more likely to have missing data, with those aged 65+ being more than 2 times more likely to have a missing year of vaccination. However, unlike BCG vaccination status, year of notification showed a clear trend of increasing data completeness from 2010 until 2015. Additionally, for year of BCG vaccination the White ethnic group was more likely to have complete data than any other ethnic group, with those of Black-Caribbean descent being over 3 times more likely to have a missing year of BCG vaccination. Socio-economic status was highly associated with year of vaccination being missing but there was little clear evidence of a trend. The second, and third, poorest quintiles were more likely to have a missing year of vaccination. Whilst the richest, and second richest quintiles were less likely to have a missing year of vaccination.

Date of symptom onset

For date of symptom onset there was strong evidence of an MAR mechanism for all variables considered, except for sex (Table 3). The likelihood of date of symptom onset being missing reduced with year of notification. Children (0-14 years old) were more likely to have a missing date of symptom onset than any other age group as were those in any socio-economic quintile when compared to the poorest group. UK born cases were more likely to have a complete date of symptom onset than non-UK born cases, with the White ethnic group being more likely to have a missing date of symptom onset than most other ethnic groups.

Table 3: Results from a logistic regression model with data completeness (Complete/Missing) for date of symptom onset as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that date of symptom onset is missing not at random for the variables for all variables considered, except for sex.
Variable Category Missing (N) Notifications (41659) Odds Ratio P value (Wald) P value (LRT)
Year 2010 34.0% (2426) 7143 0
2011 30.1% (2339) 7781 0.84 (0.78, 0.90) 1.45e-06
2012 24.2% (1878) 7755 0.61 (0.57, 0.66) 1.73e-38
2013 17.5% (1233) 7034 0.41 (0.37, 0.44) 2.6e-105
2014 11.8% (744) 6327 0.25 (0.23, 0.27) 6.1e-187
2015 6.9% (390) 5619 0.14 (0.12, 0.15) 1.7e-245
Sex Female 22.0% (3894) 17664 0.363
Male 21.3% (5116) 23995 0.98 (0.93, 1.03) 0.363
Age 0-14 38.1% (684) 1793 6.9e-78
15-44 20.5% (5182) 25235 0.33 (0.30, 0.38) 4.33e-78
45-64 20.7% (1870) 9026 0.36 (0.32, 0.41) 4.15e-58
65+ 22.7% (1274) 5605 0.44 (0.39, 0.51) 3.41e-34
Ethnic group White 20.9% (1749) 8359 1.53e-08
Black-Caribbean 23.1% (214) 928 0.76 (0.64, 0.90) 0.00216
Black-African 23.0% (1654) 7204 0.72 (0.65, 0.79) 7.47e-11
Black-Other 18.7% (69) 369 0.61 (0.45, 0.80) 0.000611
Indian 22.2% (2404) 10848 0.76 (0.70, 0.84) 1.17e-08
Pakistani 19.2% (1305) 6806 0.79 (0.72, 0.87) 3.23e-06
Bangladeshi 23.9% (401) 1680 0.80 (0.69, 0.92) 0.00178
Chinese 18.8% (93) 494 0.68 (0.53, 0.87) 0.0025
Mixed / Other 22.6% (1121) 4971 0.79 (0.71, 0.88) 1.07e-05
UK birth status Non-UK Born 21.9% (6774) 30880 0.000152
UK Born 20.7% (2236) 10779 0.86 (0.80, 0.93) 0.00016
Socio-economic status 1 19.9% (3218) 16131 1.06e-06
2 22.9% (2888) 12621 0.98 (0.92, 1.05) 0.63
3 24.2% (1578) 6530 1.17 (1.08, 1.26) 7.32e-05
4 22.0% (837) 3796 1.18 (1.07, 1.29) 0.000845
5 18.9% (489) 2581 1.17 (1.04, 1.31) 0.01
Public Health England centre London 30.0% (5289) 17658 0
West Midlands 12.0% (627) 5217
North West 20.6% (841) 4075
South East 9.0% (363) 4037
Yorkshire and the Humber 13.2% (407) 3077
East of England 26.5% (705) 2662
East Midlands 19.2% (488) 2548
South West 10.9% (174) 1595
North East 14.7% (116) 790

Date of diagnosis

For date of diagnosis there was again strong evidence for an MAR mechanism for all variables considered, except for sex for which there was very weak evidence (Supplementary Table S2). Increasing completeness was found for year of notification as seen previously, as was an increased likelihood of missing data in males and the non-UK born. The White ethnic group was less likely to be missing data on the data of diagnosis as compared to the majority of other ethnic groups, as were the poorest socio-economic group compared to all other socio-economic quintiles. Children (0-14 years old) were again more likely to be missing data than adults in any age group.

Date of starting treatment and ending treatment

For date of starting treatment there was little evidence that missing data is associated with any variable considered, except for year of notification (Supplementary Table S3). Variable completeness improved year-on-year with a 96% drop in missing data in 2015 compared to 2010. Missing data for the date of ending treatment had a comparable association with the year of notification but also had weak evidence of an association with ethnic group and socio-economic status (Supplementary Table S4). There was some evidence that the poorest socio-economic group was more likely to be missing the date of ending treatment but the evidence for this was mixed. The White ethnic group was slightly more likely to be missing the date of treatment ending than most other ethnic groups.

Date of death

For date of death there was some evidence that data was missing with an MAR mechanism for ethnic group and socio-economic status with little evidence for any other association (Supplementary Table S5). These associations should be interpreted carefully due to the strength of the evidence when compared to the number of tests conducted. Whilst the confidence intervals were wide for all ethnic groups there was some weak indication that the White ethnic group were more likely to have a complete date of death than other ethnic groups. Similarly, those in the lowest socio-economic group were somewhat more likely to have a complete date of death than other quintiles. The reduction in the levels of evidence found for case of death may be linked to the reduction in power for this outcome, as mortality is a rare outcome.

Cause of death

For cause of death there was less evidence of an MAR mechanism, with little evidence of an association for year, sex, age, or socio-economic group (Supplementary Table S6). However, there was evidence of an association with ethnic group and very weak evidence of an association with UK birth status. The White ethnic group was less likely to have an incomplete cause of death when compared to the majority of other identified ethnic groups but there was weak evidence to suggest that cause of death was in fact less likely to be missing in those identifying as being of Black-Caribbean, Black-Other, Indian and Bangladeshi descent. The confidence intervals for these estimates were wide, indicating that these estimates may not be reliable. There was again some weak evidence to suggest that the UK born were more likely to be missing a cause of death than the non-UK born, which reverses the trend observed in the other variables explored. As for the date of death cause of death had a small sample size and this may mean that this analysis was underpowered.

Assessing potential biases in reporting

  • Describe pattern in notifications by month and within months.
  • Describe which variables also follow this pattern.
  • Discuss variables with potential recall bias.

It is also likely that some of the dates recorded are inaccurate or systematically biased.

date of notification can be used as a baseline on which to judge other date variables

There is some evidence of a seasonal trend in notifications, with a higher proportion of cases notified in the May, June and July than in the rest of the year. This seasonality would have to be accounted for if conducting analysis on a monthly scale and date of notification was being used as the date of first contact with the health system. There is little evidence that date of notification varies by day of the month.

date of symptom onset, this represents the closest approximation to the date when a case became infectious. Unfortunately there are multiple issues with this measure, the first of which being is that it is only complete across the data extract.

The date of symptom onset is highly susceptible to recall bias with the majority of cases becoming symptomatic on the first of each month, with some evidence that a greater number of cases occur in January than would be expected.

Another possible measure of the number of cases is the date of diagnosis, this should be a more reliable variable than the date of symptom onset, as it does not rely on the recall of the case.

The date of starting treatment should be a more reliable contact date as it records an official contact with the health system. As for the data of notification there is some evidence of a seasonal trend for date of starting treatment, with a peak of cases starting treatment in May, June and July. However, this seasonal trend is difficult to identify when cases starting treatment are visualised by month over time. Unlike the date of symptom onset there is little evidence of recall bias by month, or by day.

The date of ending treatment does not appear to display similar seasonality. This maybe because treatment time varies between individuals and this dilutes the seasonality observed for the date of starting treatment. As noted previously, there was some evidence of recall bias when the proportion of those ending treatment was examined on a day of the month basis, with a larger proportion ending treatment on the first of the month than on any other day. The date of ending treatment was not recorded in 2000, or 2001, and was highly missing for the first several years after collection began.

Figure 1: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Figure 1: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Figure 2: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Figure 2: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Discussion

Statement of primary findings

In the ETS system we found a high degree of missing data for several important variables. We also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in studies using this data to avoid introducing bias. We found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood.

The following analysis is not currently in the paper but it was in the chapter - is there a case for including?

We also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend.

Strengths and limitations of the study

Work in progress - copied from chapter text

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[7] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[8] Validation studies would be required to account for this.

Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[9] this is not an imputation

Strengths and limitations in comparison to the literature

Meaning of the study

  • Surveillance data is likely to have a high degree of missing data. In the ETS missing for key outcomes is associated with demographic factors such as….
  • To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model.

Unanswered questions and future research

Acknowledgements

The authors thank the TB section at Public Health England (PHE) for maintaining the Enhanced Tuberculosis Surveillance (ETS) system; all the healthcare workers involved in data collection for the ETS.

Contributors

SA conceived and designed the work. SA undertook the analysis with advice from all other authors. All authors contributed to the interpretation of the data. SA wrote the first draft of the paper and all authors contributed to subsequent drafts. All authors approve the work for publication and agree to be accountable for the work.

Funding

SEA, HC, and EBP are funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Evaluation of Interventions at University of Bristol in partnership with Public Health England (PHE). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England.

Conflicts of interest

HC reports receiving honoraria from Sanofi Pasteur, and consultancy fees from AstraZeneca, GSK and IMS Health, all paid to her employer.

Accessibility of programming code

The code for the analysis contained in this paper can be found at: https://doi.org/10.5281/zenodo.3492200

References

1 Sterne JAC, White IR, Carlin JB et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj 2009;338:b2393–3.

2 Public Health England. Tuberculosis in England 2017 report ( presenting data to end of 2016 ) About Public Health England. 2017.

3 Kruijshaar M, French C, Anderson C et al. Tuberculosis in the UK, Annual report on tuberculosis surveillance and control in the UK 2007. Thorax 2007;50:703–3.

4 Pillaye J, Clarke A. An evaluation of completeness of tuberculosis notification in the United Kingdom. BMC Public Health 2003;3:31.

5 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

6 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

7 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

8 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

9 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.

Results Copy to top

Supplementary Information: Explore and Evaluate the Mechanisms for Missing Data in the Enhanced Tuberculosis Surveillance System

Sam Abbott, Hannah Christensen, Ellen Brooks-Pollock

Data completeness

Supplementary Figure S1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20\% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.

Supplementary Figure S1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.

Drivers of data completeness - additional results tables

Year of BCG vaccination

Supplementary Table S1: Results from a logistic regression model with data completeness (Complete/Missing) for year of BCG vaccination as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that year of BCG vaccination is missing at random for the variables considered.
Variable Category Missing (N) Notifications (20835) Odds Ratio P value (Wald) P value (LRT)
Year 2010 61.0% (2090) 3424 1.59e-09
2011 59.6% (2304) 3869 0.90 (0.79, 1.03) 0.134
2012 56.2% (2216) 3945 0.73 (0.64, 0.84) 6.21e-06
2013 55.7% (2025) 3638 0.75 (0.65, 0.86) 2.71e-05
2014 56.6% (1776) 3138 0.83 (0.72, 0.95) 0.00891
2015 54.2% (1530) 2821 0.64 (0.55, 0.74) 1.34e-09
Sex Female 55.5% (5089) 9174 0.275
Male 58.8% (6852) 11661 1.05 (0.97, 1.13) 0.275
Age 0-14 43.9% (488) 1111 1.21e-20
15-44 58.3% (8216) 14102 2.12 (1.77, 2.53) 1.38e-16
45-64 57.6% (2526) 4388 2.42 (1.99, 2.94) 6.72e-19
65+ 57.6% (711) 1234 3.00 (2.36, 3.83) 5.09e-19
Ethnic group White 44.2% (1370) 3102 5.86e-12
Black-Caribbean 77.5% (371) 479 1.19 (0.89, 1.61) 0.242
Black-African 65.2% (2524) 3870 0.91 (0.78, 1.07) 0.261
Black-Other 72.0% (154) 214 1.23 (0.80, 1.90) 0.349
Indian 56.1% (3516) 6267 0.75 (0.65, 0.86) 7.27e-05
Pakistani 51.6% (1583) 3066 1.10 (0.95, 1.28) 0.205
Bangladeshi 73.1% (583) 797 1.48 (1.15, 1.90) 0.00226
Chinese 58.2% (142) 244 1.23 (0.83, 1.80) 0.3
Mixed / Other 60.7% (1698) 2796 0.83 (0.70, 0.98) 0.0318
UK birth status Non-UK Born 61.1% (9665) 15808 5.14e-08
UK Born 45.3% (2276) 5027 0.74 (0.66, 0.82) 4.98e-08
Socio-economic status 1 55.4% (4221) 7615 4.64e-05
2 66.3% (4463) 6729 0.88 (0.79, 0.97) 0.0118
3 59.4% (2019) 3401 0.84 (0.74, 0.95) 0.00684
4 45.3% (838) 1848 0.70 (0.60, 0.82) 6.29e-06
5 32.2% (400) 1242 0.78 (0.65, 0.93) 0.00583
Public Health England centre London 91.0% (9421) 10358 0
West Midlands 39.3% (1010) 2568
North West 9.2% (116) 1260
South East 13.0% (293) 2261
Yorkshire and the Humber 45.2% (528) 1167
East of England 19.9% (260) 1305
East Midlands 3.1% (33) 1066
South West 38.4% (175) 456
North East 26.6% (105) 394

Date of diagnosis

Supplementary Table S2: Results from a logistic regression model with data completeness (Complete/Missing) for date of diagnosis onset as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that date of diagnosis is missing at random for the variables for all variables considered, except for sex.
Variable Category Missing (N) Notifications (41659) Odds Ratio P value (Wald) P value (LRT)
Year 2010 26.9% (1918) 7143 7.54e-286
2011 22.3% (1736) 7781 0.77 (0.71, 0.83) 2.11e-10
2012 18.8% (1458) 7755 0.61 (0.56, 0.66) 3.93e-31
2013 12.9% (909) 7034 0.38 (0.35, 0.42) 6.81e-91
2014 10.4% (659) 6327 0.30 (0.27, 0.33) 6.2e-120
2015 7.4% (415) 5619 0.20 (0.18, 0.22) 1.56e-158
Sex Female 16.9% (2984) 17664 0.432
Male 17.1% (4111) 23995 1.02 (0.97, 1.08) 0.432
Age 0-14 19.4% (348) 1793 0.000251
15-44 17.8% (4504) 25235 0.74 (0.65, 0.86) 4.77e-05
45-64 15.9% (1434) 9026 0.73 (0.62, 0.85) 3.52e-05
65+ 14.4% (809) 5605 0.79 (0.68, 0.94) 0.00563
Ethnic group White 12.5% (1043) 8359 6.85e-08
Black-Caribbean 25.2% (234) 928 1.20 (1.00, 1.43) 0.0469
Black-African 21.9% (1577) 7204 0.99 (0.89, 1.11) 0.876
Black-Other 17.9% (66) 369 0.75 (0.56, 1.01) 0.0612
Indian 18.0% (1957) 10848 0.80 (0.72, 0.89) 4.94e-05
Pakistani 11.8% (805) 6806 0.86 (0.76, 0.97) 0.0158
Bangladeshi 21.5% (361) 1680 0.94 (0.81, 1.10) 0.469
Chinese 13.4% (66) 494 0.66 (0.49, 0.88) 0.00525
Mixed / Other 19.8% (986) 4971 0.91 (0.81, 1.02) 0.117
UK birth status Non-UK Born 18.4% (5696) 30880 0.00227
UK Born 13.0% (1399) 10779 0.87 (0.80, 0.95) 0.00235
Socio-economic status 1 14.4% (2317) 16131 6.01e-14
2 19.6% (2469) 12621 0.97 (0.90, 1.04) 0.394
3 20.3% (1325) 6530 1.22 (1.12, 1.33) 5.3e-06
4 17.0% (645) 3796 1.30 (1.17, 1.45) 1.87e-06
5 13.1% (339) 2581 1.42 (1.23, 1.62) 9.74e-07
Public Health England centre London 31.0% (5471) 17658 0
West Midlands 3.6% (190) 5217
North West 7.6% (308) 4075
South East 3.9% (157) 4037
Yorkshire and the Humber 3.2% (99) 3077
East of England 11.3% (302) 2662
East Midlands 18.9% (482) 2548
South West 2.8% (45) 1595
North East 5.2% (41) 790

Date of starting treatment and ending treatment

Supplementary Table S3: Results from a logistic regression model with data completeness (Complete/Missing) for date of starting treatment as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. There is little evidence that the missing data for the date of starting treatment is associated with any variable considered, except for year of notification.
Variable Category Missing (N) Notifications (41659) Odds Ratio P value (Wald) P value (LRT)
Year 2010 5.1% (367) 7143 2.48e-37
2011 4.7% (368) 7781 0.92 (0.79, 1.07) 0.281
2012 4.0% (314) 7755 0.77 (0.66, 0.90) 0.00121
2013 3.8% (265) 7034 0.70 (0.59, 0.82) 1.7e-05
2014 2.2% (139) 6327 0.39 (0.32, 0.47) 1.36e-20
2015 2.0% (115) 5619 0.36 (0.29, 0.45) 1.65e-20
Sex Female 3.4% (608) 17664 0.00223
Male 4.0% (960) 23995 1.18 (1.06, 1.31) 0.00234
Age 0-14 3.6% (64) 1793 1.89e-29
15-44 3.1% (774) 25235 0.89 (0.68, 1.17) 0.384
45-64 3.4% (310) 9026 0.93 (0.70, 1.25) 0.628
65+ 7.5% (420) 5605 1.96 (1.49, 2.63) 3.16e-06
Ethnic group White 5.8% (486) 8359 0.00077
Black-Caribbean 3.4% (32) 928 0.71 (0.48, 1.02) 0.0765
Black-African 2.8% (203) 7204 0.61 (0.49, 0.76) 7.46e-06
Black-Other 3.3% (12) 369 0.79 (0.42, 1.38) 0.445
Indian 3.4% (371) 10848 0.71 (0.59, 0.86) 0.000401
Pakistani 3.6% (243) 6806 0.63 (0.52, 0.77) 4.66e-06
Bangladeshi 3.1% (52) 1680 0.66 (0.48, 0.90) 0.0108
Chinese 3.8% (19) 494 0.78 (0.46, 1.24) 0.318
Mixed / Other 3.0% (150) 4971 0.70 (0.55, 0.87) 0.00173
UK birth status Non-UK Born 3.4% (1045) 30880 0.516
UK Born 4.9% (523) 10779 0.95 (0.81, 1.11) 0.516
Socio-economic status 1 3.8% (611) 16131 0.665
2 3.7% (462) 12621 1.05 (0.92, 1.20) 0.481
3 3.5% (226) 6530 0.92 (0.78, 1.09) 0.336
4 4.1% (154) 3796 0.99 (0.82, 1.20) 0.934
5 4.5% (115) 2581 1.01 (0.81, 1.25) 0.925
Public Health England centre London 3.1% (551) 17658 2.84e-17
West Midlands 3.8% (198) 5217
North West 4.3% (176) 4075
South East 3.0% (121) 4037
Yorkshire and the Humber 6.6% (202) 3077
East of England 3.3% (88) 2662
East Midlands 3.2% (82) 2548
South West 6.9% (110) 1595
North East 5.1% (40) 790
Supplementary Table S4: Results from a logistic regression model with data completeness (Complete/Missing) for date of starting treatment as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. There is little evidence that the missing data for the date of starting treatment is associated with any variable considered, except for year of notification.
Variable Category Missing (N) Notifications (33606) Odds Ratio P value (Wald) P value (LRT)
Year 2010 2.9% (182) 6171 4.89e-15
2011 2.6% (177) 6855 0.88 (0.71, 1.08) 0.228
2012 2.4% (164) 6882 0.78 (0.63, 0.97) 0.0274
2013 1.5% (97) 6298 0.49 (0.38, 0.63) 3.05e-08
2014 1.2% (66) 5341 0.38 (0.29, 0.51) 5.33e-11
2015 1.4% (28) 2059 0.47 (0.31, 0.69) 0.000223
Sex Female 2.1% (311) 14630 0.506
Male 2.1% (403) 18976 1.05 (0.91, 1.23) 0.507
Age 0-14 2.7% (44) 1617 0.52
15-44 2.0% (419) 21027 0.81 (0.59, 1.14) 0.209
45-64 2.3% (165) 7272 0.83 (0.59, 1.20) 0.314
65+ 2.3% (86) 3690 0.74 (0.50, 1.11) 0.141
Ethnic group White 2.9% (176) 6076 0.0466
Black-Caribbean 2.8% (21) 753 1.51 (0.91, 2.38) 0.0888
Black-African 1.9% (114) 6071 0.90 (0.66, 1.23) 0.512
Black-Other 2.3% (7) 306 1.34 (0.56, 2.75) 0.464
Indian 1.7% (150) 8842 0.72 (0.55, 0.96) 0.0235
Pakistani 2.5% (140) 5668 0.86 (0.65, 1.13) 0.282
Bangladeshi 1.3% (18) 1409 0.65 (0.37, 1.07) 0.105
Chinese 2.8% (11) 396 1.17 (0.58, 2.14) 0.643
Mixed / Other 1.9% (77) 4085 0.98 (0.70, 1.35) 0.887
UK birth status Non-UK Born 1.9% (480) 25174 0.959
UK Born 2.8% (234) 8432 1.01 (0.81, 1.25) 0.959
Socio-economic status 1 2.4% (308) 13080 0.257
2 1.7% (170) 10266 1.03 (0.84, 1.26) 0.752
3 1.9% (100) 5265 1.09 (0.85, 1.38) 0.498
4 2.8% (84) 2994 1.36 (1.04, 1.76) 0.021
5 2.6% (52) 2001 1.08 (0.78, 1.47) 0.619
Public Health England centre London 0.7% (100) 14747 8.46e-59
West Midlands 4.2% (177) 4240
North West 2.7% (88) 3208
South East 2.5% (79) 3213
Yorkshire and the Humber 2.8% (67) 2361
East of England 4.0% (83) 2098
East Midlands 3.1% (63) 2039
South West 2.9% (32) 1122
North East 4.3% (25) 578

Date of death

Supplementary Table S5: Results from a logistic regression model with data completeness (Complete/Missing) for date of death as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that there is some evidence that date of death is missing at random for ethnic group, with weaker evidence for all other variables.
Variable Category Missing (N) Notifications (1883) Odds Ratio P value (Wald) P value (LRT)
Year 2010 16.6% (53) 320 0.129
2011 15.9% (52) 327 1.02 (0.63, 1.65) 0.938
2012 14.5% (51) 351 0.88 (0.54, 1.42) 0.593
2013 13.5% (42) 312 0.70 (0.43, 1.16) 0.169
2014 9.5% (30) 317 0.55 (0.32, 0.93) 0.0263
2015 13.3% (34) 256 0.67 (0.39, 1.14) 0.14
Sex Female 14.8% (97) 657 0.569
Male 13.5% (165) 1226 0.91 (0.67, 1.25) 0.568
Age 0-14 10.0% (1) 10 0.799
15-44 15.7% (31) 198 1.86 (0.26, 38.77) 0.596
45-64 14.6% (68) 465 1.85 (0.26, 38.20) 0.598
65+ 13.4% (162) 1210 2.11 (0.30, 43.43) 0.521
Ethnic group White 11.1% (102) 920 0.9
Black-Caribbean 21.7% (10) 46 0.90 (0.35, 2.18) 0.817
Black-African 20.1% (27) 134 0.92 (0.45, 1.92) 0.833
Black-Other 20.0% (1) 5 0.52 (0.03, 4.31) 0.586
Indian 17.4% (64) 367 0.90 (0.49, 1.70) 0.747
Pakistani 8.0% (20) 249 0.62 (0.30, 1.29) 0.204
Bangladeshi 22.7% (10) 44 0.85 (0.33, 2.12) 0.731
Chinese 14.3% (3) 21 0.80 (0.16, 3.23) 0.772
Mixed / Other 25.8% (25) 97 1.15 (0.55, 2.39) 0.711
UK birth status Non-UK Born 16.6% (167) 1004 0.796
UK Born 10.8% (95) 879 1.08 (0.61, 1.92) 0.796
Socio-economic status 1 11.4% (79) 695 0.912
2 18.3% (86) 470 0.87 (0.59, 1.29) 0.499
3 16.2% (48) 296 1.04 (0.66, 1.64) 0.87
4 12.7% (30) 237 1.02 (0.60, 1.71) 0.937
5 10.3% (19) 185 0.87 (0.46, 1.59) 0.651
Public Health England centre London 37.6% (201) 534 1.92e-57
West Midlands 2.3% (7) 305
North West 7.0% (16) 228
South East 4.8% (10) 208
Yorkshire and the Humber 3.6% (6) 168
East of England 8.5% (11) 130
East Midlands 1.9% (3) 156
South West 6.7% (7) 105
North East 2.0% (1) 49

Cause of death

Supplementary Table S6: Results from a logistic regression model with data completeness (Complete/Missing) for cause of death as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that cause of death is missing at random for ethnic group and UK birth status, with little evidence for any other variables
Variable Category Missing (N) Notifications (1883) Odds Ratio P value (Wald) P value (LRT)
Year 2010 45.0% (144) 320 0.576
2011 45.6% (149) 327 0.99 (0.71, 1.37) 0.944
2012 45.3% (159) 351 0.94 (0.68, 1.29) 0.694
2013 43.9% (137) 312 0.94 (0.67, 1.30) 0.693
2014 44.8% (142) 317 0.86 (0.62, 1.20) 0.379
2015 38.7% (99) 256 0.74 (0.52, 1.05) 0.0933
Sex Female 44.7% (294) 657 0.763
Male 43.7% (536) 1226 0.97 (0.79, 1.19) 0.763
Age 0-14 50.0% (5) 10 0.14
15-44 35.4% (70) 198 0.69 (0.17, 2.82) 0.6
45-64 43.0% (200) 465 1.02 (0.25, 4.11) 0.977
65+ 45.9% (555) 1210 1.03 (0.25, 4.13) 0.965
Ethnic group White 48.2% (443) 920 0.00768
Black-Caribbean 21.7% (10) 46 0.47 (0.20, 0.99) 0.0565
Black-African 45.5% (61) 134 1.78 (1.04, 3.03) 0.0347
Black-Other 20.0% (1) 5 0.70 (0.03, 5.37) 0.761
Indian 35.7% (131) 367 0.87 (0.56, 1.36) 0.545
Pakistani 49.4% (123) 249 1.33 (0.84, 2.11) 0.224
Bangladeshi 27.3% (12) 44 0.82 (0.36, 1.78) 0.625
Chinese 52.4% (11) 21 1.70 (0.64, 4.55) 0.284
Mixed / Other 39.2% (38) 97 1.37 (0.78, 2.41) 0.275
UK birth status Non-UK Born 40.1% (403) 1004 0.426
UK Born 48.6% (427) 879 1.17 (0.79, 1.74) 0.427
Socio-economic status 1 43.7% (304) 695 0.168
2 40.0% (188) 470 1.26 (0.97, 1.64) 0.0842
3 42.9% (127) 296 1.20 (0.89, 1.63) 0.235
4 49.8% (118) 237 1.43 (1.03, 1.98) 0.0322
5 50.3% (93) 185 1.37 (0.96, 1.97) 0.0841
Public Health England centre London 25.3% (135) 534 1.1e-20
West Midlands 48.9% (149) 305
North West 61.8% (141) 228
South East 46.6% (97) 208
Yorkshire and the Humber 44.0% (74) 168
East of England 46.2% (60) 130
East Midlands 60.3% (94) 156
South West 53.3% (56) 105
North East 49.0% (24) 49

Assessing potential biases in reporting

Supplementary Figure S2: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S2: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S3: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S3: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S4: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S4: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S5: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Supplementary Figure S5: a.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. b.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. On the scale of months there is some evidence of recall bias, with the first day reporting higher proportions of cases than would be expected. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.